Optimizing Synchronization Operations for Remote Memory Communication Systems
نویسندگان
چکیده
Synchronization operations, such as fence and locking, are used in many parallel operations accessing shared memory. However, a process which is blocked waiting for a fence operation to complete, or for a lock to be acquired, cannot perform useful computation. It is therefore critical that these operations be implemented as efficiently as possible to reduce the time a process waits idle. These operations also impact the scalability of the overall system. As system sizes get larger, the number of processes potentially requesting a lock increases. In this paper we describe the design and implementation of an optimized operation which combines a global fence operation and a barrier synchronization operation. We also describe our implementation of an optimized lock algorithm. The optimizations have been incorporated into the ARMCI communication library. The global fence and barrier operation gives a factor of improvement of up to 9 over the current implementation in a 16 node system, while the optimized lock implementation gives up to 1.25 factor of improvement. These optimizations allow for more efficient and scalable applications.
منابع مشابه
ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems
This paper introduces a new portable communication library called ARMCI. ARMCI provides one-sided communication capabilities for distributed array libraries and compiler run-time systems. It supports remote memory copy, accumulate, and synchronization operations optimized for non-contiguous data transfers including strided and generalized UNIX I/O vector interfaces. The library has been employe...
متن کاملAnalyses and Optimizations for Shared Address Space Programs
We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, called cycle analysis, is based on work by Shasha and Snir and checks for cycles a...
متن کاملCompiler-Assisted Distributed Shared Memory Schemes Using Memory-Based Communication Facilities
To execute shared-memory-based parallel programs efficiently, we introduce two compiler-assisted software cache schemes which are well-suited to automatic optimizations of remote communications. One scheme is a full user-level software cache (User-level Distributed Shared Memory: UDSM) and another is a page-based cache (Asymmetric Distributed Shared Memory: ADSM) which exploits TLB/MMU only in ...
متن کاملOptimizing Collective Communication on Multicores
As the gap in performance between the processors and the memory systems continue to grow, the communication component of an application will dictate the overall application performance and scalability. Therefore it is useful to abstract common communication operations across cores as collective communication operations and tune them through a runtime library that can employ sophisticated automa...
متن کاملMemory-Based Communication Facilities and Asymmetric Distributed Shared Memory
In general-purpose parallel and distributed systems, performance of the protected and virtualized user-level communications and synchronizations is the most crucial issue to realize efficient execution environments. We proposed a novel high-speed user-level communication and synchronization scheme “Memory-Based Communication Facilities (MBCF)” for a general-purpose system with an off-the-shelf ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003